Lesson 1: Working with Gridded Spatial Data in Python¶
Objective: Introduce packages for working with gridded spatial data in Python and learn how to use these to manipulate spatial data. We will work with multidimensional gridded data in xarray and perform geospatial operations on xarrays using rioxarray
Step 1. Load the necessary libraries¶
import fsspec #connecting to data on aws
import warnings #dont print warnings
warnings.filterwarnings('ignore')
import xarray as xr #for gridded data
import numpy as np #for arrays in python
from dask.diagnostics import ProgressBar #progress bar
Step 2. Gridded data with xarray¶
Xarray is a package for working with multidimensional gridded data in Python. While the package numpy provides many of the core operations we need for working with gridded data like indexing matrix operations, etc it does not provide the functionality to add information about the various dimensions of arrays, the coordinates of grid cells, or attached important metadata. This is where xarray comes in.
By including labels on array dimensions xarray opens up many new possibilities:
applying operations over dimensions by name: x.sum('time').
selecting values by label x.sel(time='2014-01-01').
use the split-apply-combine paradigm with groupby: x.groupby('time.dayofyear').mean().
keeping track of arbitrary metadata in the form of a Python dictionary: x.attrs.
and much more
The xarray data structure makes it trivial to go from 2 to 3 to 4 to N dimensions, hence it is a great choice for working with gridded data where we have at least 3 (lat, lon, time) dimensions. Another big benefit is that it seamlessly integrates with Dask a popular library for parallel computing in Python. This allows us to scale analysis with xarray to very large data.
The core data structure of xarray is an xarray.DataArray - which in its simplest form is just a Numpy array with named dimensions and coordinates on those dimensions. We can combine multiple xarray.DataArray in a single structure called a xarray.Dataset. Let's see what this looks like
#create a 2x3 np array
arr = np.array([[1,2,3],[5,6,7]])
#create a xarray.DataArray by naming the dims and giving them coordinates
xda = xr.DataArray(arr,
dims=("x", "y"),
coords={"x": [10, 20],
"y": [1.1,1.2,1.3]})
xda
<xarray.DataArray (x: 2, y: 3)> Size: 48B
array([[1, 2, 3],
[5, 6, 7]])
Coordinates:
* x (x) int64 16B 10 20
* y (y) float64 24B 1.1 1.2 1.3We can access the individual components like the data itself, the dimension names or the coordinates using accessors
#get the underlying array/matrix
print(xda.values)
#get the dimension names
print(xda.dims)
#get the x coordinates
print(xda.coords['x'])
[[1 2 3]
[5 6 7]]
('x', 'y')
<xarray.DataArray 'x' (x: 2)> Size: 16B
array([10, 20])
Coordinates:
* x (x) int64 16B 10 20
We can set or get any metadata attribute we like
xda.attrs["long_name"] = "random measurement"
xda.attrs["random_attribute"] = 123
print(xda.attrs)
{'long_name': 'random measurement', 'random_attribute': 123}
and perform calculations on xarray.DataArrays as if they were numpy arrays
xda + 10
<xarray.DataArray (x: 2, y: 3)> Size: 48B
array([[11, 12, 13],
[15, 16, 17]])
Coordinates:
* x (x) int64 16B 10 20
* y (y) float64 24B 1.1 1.2 1.3
Attributes:
long_name: random measurement
random_attribute: 123np.sin(xda)
<xarray.DataArray (x: 2, y: 3)> Size: 48B
array([[ 0.84147098, 0.90929743, 0.14112001],
[-0.95892427, -0.2794155 , 0.6569866 ]])
Coordinates:
* x (x) int64 16B 10 20
* y (y) float64 24B 1.1 1.2 1.3
Attributes:
long_name: random measurement
random_attribute: 123An xarray.Dataset is a container of multiple aligned DataArray objects
#create a new dataarray with aligned dimensions (but it can be more or fewer dims)
#create a new 2x3x4 xarray Dataarray
arr2 = np.random.randn(2, 3, 4)
xda2 = xr.DataArray(arr2,
dims=("x", "y","z"),
coords={"x": [10, 20],
"y": [1.1,1.2,1.3],
"z": [20,200,2000,20000]})
#combine with another xarray.DataArray to make a xarray.Dataset
xds = xr.Dataset({'foo':xda,'bar':xda2})
xds
<xarray.Dataset> Size: 312B
Dimensions: (x: 2, y: 3, z: 4)
Coordinates:
* x (x) int64 16B 10 20
* y (y) float64 24B 1.1 1.2 1.3
* z (z) int64 32B 20 200 2000 20000
Data variables:
foo (x, y) int64 48B 1 2 3 5 6 7
bar (x, y, z) float64 192B 0.3587 0.8288 0.7981 ... -0.2103 -0.05131Here you can see that we have multiple arrays in a single dataset. Xarray automatically aligns the arrays based on shared dimensions and coodrinates. You can do almost everything you can do with DataArray objects with Dataset objects (including indexing and arithmetic) if you prefer to work with multiple variables at once. You can also easily retrieve a single DataArray by name from a Dataset
xds.foo
# xds['foo'] works the same
<xarray.DataArray 'foo' (x: 2, y: 3)> Size: 48B
array([[1, 2, 3],
[5, 6, 7]])
Coordinates:
* x (x) int64 16B 10 20
* y (y) float64 24B 1.1 1.2 1.3
Attributes:
long_name: random measurement
random_attribute: 123Terminology¶
It is important to be precise with our terminology when dealing with Xarrays as things can quickly get confusing when working with many dims. The full glossary can be found here, but a quick recap:
xarray.DataArray- A multi-dimensional array with labeled or named dimensionsxarray.Dataset- A collection of DataArrays with aligned dimensions- Dimension - The (named) axes of an array
- Coordinate - An array that labels a dimension
Step 3. Loading data from the cloud¶
Xarray supports reading and writing of several file formats, from simple Pickle files to the more flexible netCDF format, and the cloud-optimized zarr format. When we are working with complex multidimensional data, file formats start to matter a lot, and they make a big difference to how fast and efficiently we can load and analyse data. More on this in the next lesson.
We can load files to create a new Dataset using open_dataset(). Similarly, a DataArray can be saved to disk using the DataArray.to_netcdf() ot DataArray.to_zarr() method.
We can easily work with datasets stored on our local hard drive using xarray, but we are limited by two key constraints:
- The data must fit on our hard disk.
- The data must fit in our system's RAM.
While this is sufficient for many tasks, it imposes significant limitations on the size of the data we can handle. For example, if we need to analyze multiple satellite images or large datasets from climate models, we may quickly reach these limits.
Cloud-based data analysis offers a solution to these constraints. Cloud storage is not only cost-effective but also infinitely scalable. Additionally, we can dynamically scale up the compute power—such as increasing the amount of RAM or the number of CPUs—when required for resource-intensive tasks. This model of connecting scalable compute with virtually unlimited cloud data storage opens up new possibilities for working with large, gridded datasets. It is also very costs effective. Processing 1TB of data can cost in the order of 0.1 USD.
Amazon Web Services (AWS) is one such cloud platform that facilitates this model. AWS SageMaker provides scalable compute resources, while AWS S3 offers scalable storage.
In the following example, we will demonstrate how to connect to a dataset stored in S3 and open it using xarray. The dataset we will use is satellite-derived Sea Surface Temperature
#correctly format the path to the data on AWS s3
s3path = fsspec.get_mapper('s3://mur-sst/zarr', anon=True)
#open data
ds_sst = xr.open_dataset(s3path,
engine='zarr',
chunks='auto')
ds_sst
<xarray.Dataset> Size: 104TB
Dimensions: (time: 6443, lat: 17999, lon: 36000)
Coordinates:
* time (time) datetime64[ns] 52kB 2002-06-01T09:00:00 ... 2020...
* lat (lat) float32 72kB -89.99 -89.98 -89.97 ... 89.98 89.99
* lon (lon) float32 144kB -180.0 -180.0 -180.0 ... 180.0 180.0
Data variables:
analysed_sst (time, lat, lon) float64 33TB dask.array<chunksize=(4225, 63, 63), meta=np.ndarray>
analysis_error (time, lat, lon) float64 33TB dask.array<chunksize=(4225, 63, 63), meta=np.ndarray>
mask (time, lat, lon) int8 4TB dask.array<chunksize=(6443, 100, 100), meta=np.ndarray>
sea_ice_fraction (time, lat, lon) float64 33TB dask.array<chunksize=(4225, 63, 63), meta=np.ndarray>
Attributes: (12/47)
Conventions: CF-1.7
Metadata_Conventions: Unidata Observation Dataset v1.0
acknowledgment: Please acknowledge the use of these data with...
cdm_data_type: grid
comment: MUR = "Multi-scale Ultra-high Resolution"
creator_email: ghrsst@podaac.jpl.nasa.gov
... ...
summary: A merged, multi-sensor L4 Foundation SST anal...
time_coverage_end: 20200116T210000Z
time_coverage_start: 20200115T210000Z
title: Daily MUR SST, Final product
uuid: 27665bc0-d5fc-11e1-9b23-0800200c9a66
westernmost_longitude: -180.0Chunks?¶
When opening our data we can specify that we want the data split into chunks along each dimension
What does this do, and why should we do it?¶
If you don't specify that you want the dataset chunked, xarray will load all the data into a numpy array. This can be okay if you are working witha small dataset but as your data grows larger chunking has a number of advantages:
Efficient Memory Usage Without chunking, xarray loads the entire dataset into memory as
NumPyarrays, which can use a lot of RAM and may cause your system to slow down or crash. Chunking splits the data into smaller pieces, allowing you to work with datasets that are bigger than your available memory by loading only what you need.Better Performance Processing smaller chunks can speed up computations and make data handling more efficient. Data is loaded into memory only when required, reducing unnecessary memory usage and improving processing speed.
Default chunking and rechunking¶
Some file types like netCDF, zarr or cloud-optimized geotiff have native chunking, and it is usually most efficient to use the chunking that is already present. If you specify chunks='auto' chunking will be automatically determined. This is a major advantage as chunking/rechunking can be expensive for large files. The downside is that you are subject to the chunking chosen by the creator of the file.
Checkout the dask documentation on chunks to find out more about chunking .
Indexing, selecting and masking¶
While you can use numpy-like indexing e.g da[:,:], this does not make use of the power of having named dims and coords. Xarrayas specific method for selecting using the position in the array .isel() and using the coordinates with .sel()
#idexing using position
ds_sst.isel(lon=20,lat=20)
<xarray.Dataset> Size: 213kB
Dimensions: (time: 6443)
Coordinates:
* time (time) datetime64[ns] 52kB 2002-06-01T09:00:00 ... 2020...
lat float32 4B -89.79
lon float32 4B -179.8
Data variables:
analysed_sst (time) float64 52kB dask.array<chunksize=(4225,), meta=np.ndarray>
analysis_error (time) float64 52kB dask.array<chunksize=(4225,), meta=np.ndarray>
mask (time) int8 6kB dask.array<chunksize=(6443,), meta=np.ndarray>
sea_ice_fraction (time) float64 52kB dask.array<chunksize=(4225,), meta=np.ndarray>
Attributes: (12/47)
Conventions: CF-1.7
Metadata_Conventions: Unidata Observation Dataset v1.0
acknowledgment: Please acknowledge the use of these data with...
cdm_data_type: grid
comment: MUR = "Multi-scale Ultra-high Resolution"
creator_email: ghrsst@podaac.jpl.nasa.gov
... ...
summary: A merged, multi-sensor L4 Foundation SST anal...
time_coverage_end: 20200116T210000Z
time_coverage_start: 20200115T210000Z
title: Daily MUR SST, Final product
uuid: 27665bc0-d5fc-11e1-9b23-0800200c9a66
westernmost_longitude: -180.0We can use all the same techniques, but provide coordinate values rather than positions if we use .sel(). We can also provide an option for what to do if we do not get an exact match to the provided coordinates.
ds_sst.sel(lon=-10,lat=-10,method='nearest')
<xarray.Dataset> Size: 213kB
Dimensions: (time: 6443)
Coordinates:
* time (time) datetime64[ns] 52kB 2002-06-01T09:00:00 ... 2020...
lat float32 4B -10.0
lon float32 4B -10.0
Data variables:
analysed_sst (time) float64 52kB dask.array<chunksize=(4225,), meta=np.ndarray>
analysis_error (time) float64 52kB dask.array<chunksize=(4225,), meta=np.ndarray>
mask (time) int8 6kB dask.array<chunksize=(6443,), meta=np.ndarray>
sea_ice_fraction (time) float64 52kB dask.array<chunksize=(4225,), meta=np.ndarray>
Attributes: (12/47)
Conventions: CF-1.7
Metadata_Conventions: Unidata Observation Dataset v1.0
acknowledgment: Please acknowledge the use of these data with...
cdm_data_type: grid
comment: MUR = "Multi-scale Ultra-high Resolution"
creator_email: ghrsst@podaac.jpl.nasa.gov
... ...
summary: A merged, multi-sensor L4 Foundation SST anal...
time_coverage_end: 20200116T210000Z
time_coverage_start: 20200115T210000Z
title: Daily MUR SST, Final product
uuid: 27665bc0-d5fc-11e1-9b23-0800200c9a66
westernmost_longitude: -180.0We can select continuous segments using slice
ds_sst = ds_sst.sel(lon=slice(-20,-19),lat=slice(-10,-9))
We can mask values in our array using conditions based on the array values or coordinate values with .where()
# drop bad bands
ds_sst_2019 = ds_sst.where(ds_sst.time >= np.datetime64('2019-01-01'), drop=True)
ds_sst_2019
<xarray.Dataset> Size: 110MB
Dimensions: (time: 385, lat: 101, lon: 101)
Coordinates:
* time (time) datetime64[ns] 3kB 2019-01-01T09:00:00 ... 2020-...
* lat (lat) float32 404B -10.0 -9.99 -9.98 ... -9.02 -9.01 -9.0
* lon (lon) float32 404B -20.0 -19.99 -19.98 ... -19.01 -19.0
Data variables:
analysed_sst (time, lat, lon) float64 31MB dask.array<chunksize=(385, 2, 3), meta=np.ndarray>
analysis_error (time, lat, lon) float64 31MB dask.array<chunksize=(385, 2, 3), meta=np.ndarray>
mask (time, lat, lon) float32 16MB dask.array<chunksize=(385, 1, 1), meta=np.ndarray>
sea_ice_fraction (time, lat, lon) float64 31MB dask.array<chunksize=(385, 2, 3), meta=np.ndarray>
Attributes: (12/47)
Conventions: CF-1.7
Metadata_Conventions: Unidata Observation Dataset v1.0
acknowledgment: Please acknowledge the use of these data with...
cdm_data_type: grid
comment: MUR = "Multi-scale Ultra-high Resolution"
creator_email: ghrsst@podaac.jpl.nasa.gov
... ...
summary: A merged, multi-sensor L4 Foundation SST anal...
time_coverage_end: 20200116T210000Z
time_coverage_start: 20200115T210000Z
title: Daily MUR SST, Final product
uuid: 27665bc0-d5fc-11e1-9b23-0800200c9a66
westernmost_longitude: -180.0xarray has lots of functionality, and allows you to do most of the common operations you need for gridded data. For example grouping and aggregation:
ds_sst_2019 = ds_sst_2019.groupby('time.month').mean()
ds_sst_2019
<xarray.Dataset> Size: 3MB
Dimensions: (month: 12, lat: 101, lon: 101)
Coordinates:
* month (month) int64 96B 1 2 3 4 5 6 7 8 9 10 11 12
* lat (lat) float32 404B -10.0 -9.99 -9.98 ... -9.02 -9.01 -9.0
* lon (lon) float32 404B -20.0 -19.99 -19.98 ... -19.01 -19.0
Data variables:
analysed_sst (month, lat, lon) float64 979kB dask.array<chunksize=(1, 2, 3), meta=np.ndarray>
analysis_error (month, lat, lon) float64 979kB dask.array<chunksize=(1, 2, 3), meta=np.ndarray>
mask (month, lat, lon) float32 490kB dask.array<chunksize=(1, 1, 1), meta=np.ndarray>
sea_ice_fraction (month, lat, lon) float64 979kB dask.array<chunksize=(1, 2, 3), meta=np.ndarray>
Attributes: (12/47)
Conventions: CF-1.7
Metadata_Conventions: Unidata Observation Dataset v1.0
acknowledgment: Please acknowledge the use of these data with...
cdm_data_type: grid
comment: MUR = "Multi-scale Ultra-high Resolution"
creator_email: ghrsst@podaac.jpl.nasa.gov
... ...
summary: A merged, multi-sensor L4 Foundation SST anal...
time_coverage_end: 20200116T210000Z
time_coverage_start: 20200115T210000Z
title: Daily MUR SST, Final product
uuid: 27665bc0-d5fc-11e1-9b23-0800200c9a66
westernmost_longitude: -180.0⚠️ NOTE: You may notice that often it takes almost no time at all to run xarray code. This is because for many functions xarray does not load data from disk and actually perform the calculation, rather it simply prints a summary and high-level overview of the data that will be produced. This is called **Lazy computation and is the smart thing to do when working with large datasets. Only when you really need to do the calculation does it actually happen - like when calling .plot() or writing results. We can force computation by running .compute()**
with ProgressBar():
ds_sst_2019 = ds_sst_2019.compute()
[########################################] | 100% Completed | 7.76 sms
ds_sst_2019
<xarray.Dataset> Size: 3MB
Dimensions: (month: 12, lat: 101, lon: 101)
Coordinates:
* month (month) int64 96B 1 2 3 4 5 6 7 8 9 10 11 12
* lat (lat) float32 404B -10.0 -9.99 -9.98 ... -9.02 -9.01 -9.0
* lon (lon) float32 404B -20.0 -19.99 -19.98 ... -19.01 -19.0
Data variables:
analysed_sst (month, lat, lon) float64 979kB 299.8 299.8 ... 299.3
analysis_error (month, lat, lon) float64 979kB 0.3698 0.3698 ... 0.3748
mask (month, lat, lon) float32 490kB 1.0 1.0 1.0 ... 1.0 1.0
sea_ice_fraction (month, lat, lon) float64 979kB -1.28 -1.28 ... -1.28
Attributes: (12/47)
Conventions: CF-1.7
Metadata_Conventions: Unidata Observation Dataset v1.0
acknowledgment: Please acknowledge the use of these data with...
cdm_data_type: grid
comment: MUR = "Multi-scale Ultra-high Resolution"
creator_email: ghrsst@podaac.jpl.nasa.gov
... ...
summary: A merged, multi-sensor L4 Foundation SST anal...
time_coverage_end: 20200116T210000Z
time_coverage_start: 20200115T210000Z
title: Daily MUR SST, Final product
uuid: 27665bc0-d5fc-11e1-9b23-0800200c9a66
westernmost_longitude: -180.0Step 4. Make xarray geospatial with rioxarray¶
Although we have latitude and longitude values associated with our Xarray, this data is not a proper geospatial dataset and hence we cannot do spatial manipulations like calculating distances or reprojecting. Xarray is a general-purpose tool for any multidimensional data and is not specific to geospatial data. We need an additional package rioxarray which brings all of the power of GDAL to Xarrays. rioxarray extends Xarray with the rio accessor. What this means is that a bunch of new functions become available to Xarray instances by typing .rio. It also allows us to open geospatial datasets, like geotiffs using xr.open_dataset(...,engine='rasterio')
import rioxarray
We can load a cloud-optimized geotiff stored on AWS by directly providing the url to the file location. This specific file is a single band from the Sentinel 2 satellite
s2 = xr.open_dataset('https://sentinel-cogs.s3.us-west-2.amazonaws.com/sentinel-s2-l2a-cogs/34/H/CH/2018/9/S2A_34HCH_20180923_0_L2A/B08.tif',
engine='rasterio',
chunks='auto')
s2
<xarray.Dataset> Size: 482MB
Dimensions: (band: 1, x: 10980, y: 10980)
Coordinates:
* band (band) int64 8B 1
* x (x) float64 88kB 3e+05 3e+05 3e+05 ... 4.098e+05 4.098e+05
* y (y) float64 88kB 6.3e+06 6.3e+06 6.3e+06 ... 6.19e+06 6.19e+06
spatial_ref int64 8B ...
Data variables:
band_data (band, y, x) float32 482MB dask.array<chunksize=(1, 5120, 5120), meta=np.ndarray>Beacsue this file has projection information associated with it, we can perform geospatial opeations on it, like .clip()
geometries = [
{
'type': 'Polygon',
'coordinates': [[
[300115, 6250015],
[310415, 6260015],
[320815, 6260015],
[310415, 6250015],
[300215, 6240015]
]]
}
]
clipped = s2.rio.clip(geometries)
clipped
<xarray.Dataset> Size: 17MB
Dimensions: (band: 1, x: 2070, y: 2000)
Coordinates:
* band (band) int64 8B 1
* x (x) float64 17kB 3.001e+05 3.001e+05 ... 3.208e+05 3.208e+05
* y (y) float64 16kB 6.26e+06 6.26e+06 ... 6.24e+06 6.24e+06
spatial_ref int64 8B 0
Data variables:
band_data (band, y, x) float32 17MB dask.array<chunksize=(1, 1118, 2070), meta=np.ndarray>or reproject
clipped = clipped.rio.reproject('epsg:4326')
clipped
<xarray.Dataset> Size: 17MB
Dimensions: (x: 2292, y: 1850, band: 1)
Coordinates:
* x (x) float64 18kB 18.84 18.84 18.84 18.84 ... 19.06 19.06 19.06
* y (y) float64 15kB -33.78 -33.78 -33.78 ... -33.96 -33.96 -33.97
* band (band) int64 8B 1
spatial_ref int64 8B 0
Data variables:
band_data (band, y, x) float32 17MB nan nan nan nan ... nan nan nan nanWe can also plot it on top of other spatial data. Here we will overlay it with a satellite basemap. We will use the package hvplot to make this plot interactive, allowing us to pan and zoom.
#plotting
import hvplot.xarray
import holoviews as hv
hvplot.extension('bokeh')
#plot with a satellite basemap
clip_plot = clipped['band_data'].isel(band=0)
#plot
clip_plot.hvplot(tiles=hv.element.tiles.EsriImagery(),
project=True,clim=(1,10000),
cmap='magma',frame_width=800,data_aspect=1,alpha=0.7,title='Sentinel 2 near-infrared')
Finally, if you want to save your file to a cloud-optmized GeoTIFF, you can use the .rio.to_raster() method and specify COG as the driver parameter
⚠️ NOTE: THE FOLLOWING CODE BLOCK SAVES THE RASTER TO A CLOUD-OPTIMIZED GEOTIFFS IN S3. FOR SECURITY REASONS, WE HAVE REMOVED 'WRITE' ACCESS TO THE PUBLIC BUCKET USED IN THIS TRAINING. YOU CAN USE THIS CODE TO SAVE TO YOUR OWN S3 BUCKET WITHIN TNC's AWS ACCOUNT
import os
# This error occurs because GDAL's /vsis3/ handler does not support random write operations (like updating existing files or specific GeoTIFF writing patterns) by default.
# To allow this, you must instruct GDAL to use a local temporary file for staging the write before uploading to S3
# Set the environment variable
os.environ["CPL_VSIL_USE_TEMP_FILE_FOR_RANDOM_WRITE"] = "YES"
#s3_cog_path = "/vsis3/your-bucket-name/path/to/output_cog.tif"
s3_path = "/vsis3/ocs-training-2026/advanced/s2_nir_cog.tif"
clip_plot.rio.to_raster(raster_path=s3_path, driver="COG")
Step 5. Data search and discovery¶
How can we find and discover datasets stored in the cloud? To address this challenge, a number of metadata standards have been developed to help organize and document datasets, making them easier to find, query, and use.
These metadata standards serve as structured descriptions of data, allowing us to efficiently search through catalogs and discover datasets that meet specific criteria, such as geographic location, time range, or data type. Two widely used examples of these metadata standards are STAC (SpatioTemporal Asset Catalog) and intake.
STAC (SpatioTemporal Asset Catalog) is a specification designed for geospatial data, such as satellite imagery or climate model outputs. It organizes data by providing a uniform structure to describe assets (e.g., satellite images) with spatial and temporal metadata. This enables users to efficiently search for and filter relevant data across large, distributed datasets in the cloud.
intakeis a more general-purpose library that helps manage and access data catalogs in Python. It supports various data formats and types, allowing users to interact with both local and remote datasets in a unified way. Withintake, we can browse, search, and load datasets without needing to worry about their underlying storage format or location.
These tools and standards empower users to seamlessly navigate through vast amounts of cloud-hosted data and extract just what is needed for their analyses. By leveraging these metadata-driven catalogs, we can make cloud data discovery efficient, even when dealing with enormous and complex datasets.
Even with robust metadata standards like STAC and intake, we still need to know where to find the catalogs that contain the datasets we’re interested in. This can sometimes be a challenging task, especially given the vast amount of data available across different platforms and cloud providers. However, there are several excellent starting points for discovering cloud-hosted datasets, particularly those related to Earth observation, geospatial analysis, and open data.
Some key resources include:
AWS Earth: Amazon Web Services (AWS) hosts a wide range of Earth observation data, making it accessible for analysis in the cloud. The AWS Earth page highlights various datasets related to satellite imagery, weather, and environmental monitoring. It also includes case studies and tools for working with these datasets. This is a great resource for those seeking publicly accessible datasets related to Earth science.
AWS Open Data Registry: AWS maintains an extensive Open Data Registry, which catalogs a wide variety of public datasets across different fields, including geospatial data, climate science, genomics, and more. The registry provides detailed information about each dataset, including links to the data on AWS S3, metadata, and documentation. This resource is particularly useful for discovering datasets that are freely accessible for cloud-based analysis.
NASA Earthdata: NASA Earthdata provides access to a vast collection of Earth science data, particularly those collected by NASA's satellites and field measurement programs. The platform offers powerful search tools, including the Earthdata Search tool, which allows users to filter and download datasets based on specific criteria like spatial and temporal coverage, data type, and more. NASA Earthdata is a go-to source for anyone working on climate, weather, land cover, and atmospheric studies, with extensive documentation and tutorials available to help users get started. Much of NASAs data is already on AWS, so using it is simply a matter of finding the url for the data you want.
Radiant Earth STAC Browser: The STAC Browser is an interactive web-based tool that allows users to browse STAC-compliant datasets. It provides a user-friendly interface to search for geospatial datasets cataloged using the STAC standard. Radiant Earth is focused on providing open geospatial data for machine learning and Earth observation applications, making this a valuable resource for researchers in these fields.
Despite these resources, discovering the right datasets can still require some trial and error, especially when dealing with specialized or niche datasets. It's important to explore these platforms, understand the types of data available, and take advantage of the metadata standards and search tools they provide to refine your search.
As cloud data storage grows and standards evolve, the process of discovering and accessing large, cloud-hosted datasets will continue to improve, making it easier to find the data you need for complex analyses.
Conclusions¶
We've demonstrated the following concept in this workbook:
- Use
xarrayto efficiently run operations on multidimensional gridded datasets - Load data from a public cloud data repository such as data stored in AWS S3
- Use
rioxarrayto add geospatial capabilities and spatial data operations to yourxarraydataset - Take advantage of several cloud data catalogs to pull data into your workflows and avoid downloads
Additional Resources¶
Great places to learn more about working with gridded data in python:
- The Carpentries Geospatial Python lesson by Ryan Avery
- The xarray user guide
- An Introduction to Earth and Environmental Data Science
- AWS Skill Builder: This training portal provided by AWS contains self-paced training modules for all of AWS' cloud storage and compute services. While many of the courses are behind a paywall, many of the introductory courses are free to access. Use the web application's filtering function to focus your search, for example to
Freecourses of theFundamentalskill level focused onData analytics